ggplot2 codepurrr::map*rsamplehttps://github.com/topepo/rstudio-conf-2018
The session will step through the process of building, visualizing, testing and comparing models that are focused on prediction. The goal of the course is to provide a through workflow in R that can be used with many different regression or classification techniques. Case studies are illustrated functionality.
The goal is to be able to easily build predictive/machine learning models in R using a variety of packages and model types. - “Moldes that are focused on prediction”: what does that mean? - “Machine learning”: so this is deep learning with massive data sets, right?
The course is broken up into sections for regression (predicting numeric outcome) and classification (predicting a category).
C, C++, tensorflow, keras, python, stan, or Weka, you can access these applications without leaving R.For example: - here are two methods for specifying what terms are in a model1. Not all models have both. - 99% of model functions automatically generate dummy variables. - Sparse matrices can be used (unless the can’t).
| Function | Package | Code |
|---|---|---|
| lda | MASS | predict(obj) |
| glm | stats | predict(obj, type = “response”) |
| gbm | gbm | predict(obj, type = “response”, n.trees) |
| mda | mda | predict(obj, type = “posterior”) |
| rpart | rpart | predict(obj, type = “prob”) |
| Weka | RWeka | predict(obj, type = “probability”) |
| logitboost | LogitBoost | predict(obj, type = “raw”, nIter) |
There are two main philosophies to data analysis code that will be discussed in this workshop:
The main traditional approach uses high-level syntax and is perhaps the most untidy code that you will encounter.
caret is the primary package for untidy predictive modeling: 1. More traditional R coding style. 2. High-level “I do that for you” syntax. 3. More comperehensive (for now) and less modlular. 4. Contains many optimizations and is easily parallelized.
The tidy modeling approach espouses the tenets of the tidyverse 1. Reuses existing data structures 2. Compose simple functions with the pipe 3. Embrase functional programming 4. Design for humans
This approach is exemplified by packages such as: modelr, broom, recipes, rsample, yardstick and tidyposterior.
For regression problems, we will use the Ames IA housing data. There are 2,930 properties in the data.
The sale price was recorded along 81 predictors, including - Location (e.g. neighborhood) and lot information. - House components (garage, fireplace, pool, porch, etc.). - General assessments such as overall quality and condition. - Number of bedrooms, baths, and so on.
More details can be found in De Cock (2011, Journal of Statistics Education).
he raw data are at http://bit.ly/2whgsQM but we will use a processed version found in the AmesHousing package.
library(AmesHousing)
AmesHousing::ames_rawlibrary(AmesHousing)
ames_geo Assuming "Longitude" and "Latitude" are longitude and latitude, respectively
The data that are used here are an extended version of the ubiquitous mtcars data set. fueleconomy.gov was used to obtain fuel efficiency data on cars from 2015-18.
Over this time range, duplicate ratings were eliminated; these occur when the same car is sold for several years in a row. As a result, there are 3294 cars that are listed in the data. The predictors include the automaker and addition information about the cars (e.g. intake valves per cycle, aspiration method, etc).
In our analysis, the data from 2015-2017 are used for training to see if we can predict the 609 cars that were new in 2018.
These data are supplied in the GitHub repo.
OkCupid is an online data site that serves international users. Kim and Escobedo-Land (2015, Journal of Statistics Education) describe a data set where over 50,000 profiles from the San Fransisco area were made available by the company.
The data contains several types of fields:
We will try to predict whether someone has a profession in the STEM fields (science, technology, engineering, and math) using a random sample of the overall dataset.
Many tidyverse functions have syntax unlike base R code. For example:
contains("Sepal")
# instead of
c("Sepal.Width", "Sepal.Length")merged <- inner_join(a, b)
# is equal to
merged <- a %>%
inner_join(b)dplyr’s filter and select VS base::subset).library(tidyverse)
ames <- read_delim("http://bit.ly/2whgsQM", delim = "\t") %>%
rename_at(vars(contains(' ')), funs(gsub(' ', '_', .))) %>%
rename(Sale_Price = SalePrice) %>%
filter(!is.na(Electrical)) %>%
select(-Order, -PID, -Garage_Yr_Blt)library()
ames <- ames_raw %>%
rename_at(vars(contains(' ')), funs(gsub(' ', '_', .))) %>%
rename(Sale_Price=SalePrice) %>%
filter(!is.na(Electrical)) %>%
select(-Order,-PID, -Garage_Yr_Blt)
ames %>%
group_by(Alley) %>%
summarize(mean_price=mean(Sale_Price/1000),
n=sum(!is.na(Sale_Price)))ggplot2 codelibrary(ggplot2)
ggplot(ames,
aes(x=Garage_Type,
y=Sale_Price))+
geom_violin()+
coord_trans(y="log10")+
xlab("Garage Type")+
ylab("Sale Price")purrr::map*library(purrr)
# Summarize via purrr::map
by_alley <- split(ames, ames$Alley)
is_list(by_alley)[1] TRUE
# glimpse(by_alley)map(by_alley, nrow)$`Grvl`
[1] 120
$Pave
[1] 78
map_int(by_alley, nrow)Grvl Pave
120 78
# work on no-list vectors too
ames %>%
mutate(Sale_Price=Sale_Price %>%
map_dbl(function(x)x/1000)) %>%
select(Sale_Price, Yr_Sold) %>%
head()To get warmed up, let’s load the Ames data and do some basic investigations into the variables, such as exploratory visualizations or summary statistics. The idea is to get a feel for the data.
library(AmesHousing)
ames <- make_ames()Part 2 Basic Principles - Data Splitting, Models in R, Resampling, Tuning(rsample)
Part 3 Feature engineering preprocessing - Data treatment (recipes)
Part 4 Regression Modeling - Measuring Performance, penalized regression, multivariate adaptive regression splines (MARS), ensembles (yardstick, recipes, caret, earth, glmnet, tidyposterior, doParallel)
Part 5 Classification Modeling - Measuring Performance, trees, ensembles, naive Bayes (yardstick, recipes, caret, rpart, klaR, tidyposterior)
http://www.tidyverse.org/ R for Data Science Jenny’s purrr tutorial or Happy R Users Purrr Programming with dplyr vignette Selva Prabhakaran’s ggplot2 tutorial caret package documentation CRAN Machine Learning Task View
About these slides…. they were created with Yihui’s xaringan and the stylings are a slightly modified version of Patrick Schratz’s Metropolis theme.
In this section, we will introduce concepts that are useful for any type of machine learning model: - modeling versus the model - data splitting - resampling - tuning parameters and overfitting - model tuning
Many of these topics will be put into action in later sections.
Common steps during model building are:
Many books and course portray predictive modeling as a short sprint. A better analogy would be a marathon or campaign (depending on how hard the problem is).
knitr::include_graphics("C:/Users/kojikm.mizumura/Desktop/Data Science/UseR 2018/Applied ML/intro-process-1.png")How do we “spend” the data to find an optimal model? We typically split data into training an test data sets:
The more data we spend, the better estimates we’ll get (provided the data is accurate).
Given a fixed amount of data: - too much spent in training won’t allow us to get a good assessment of predictive peformance. We may find a model that fits the training data very well, but is not generalizable (overfitting) - too much spent in testing won’t allow us to get a good assessment of model parameters
Statisticall,y the est course of action would be use all the data for model building and use statistical methods to get estimates of error.
From a non-statistical perspective, many consmers of complex models emphasize the need for untouched set of sampled to evaluate performance.
When a large amount of data are available, it might seem like a good idea to put a large amount into the training set. Personally, I think that this causes more trouble than it is worth due to diminishing returns on performance and the added cost and compexity of the required infrastructure.
Alternatively, it is probably a better idea to reserve good percentages of the data for specific parts of the modeling process. For example:
Also there may be little need for iterative resampling of the data. A single holdout (aka validation set) may be sufficient in some cases if the data are large enough and the data sampling mechanism is solid.
There are a few different ways to do the split: simple random sampling, stratified sampling based on the outcome, by date, or methods that focus on the distribution of the predictors.
For stratification: - classification: this would mean sampling within the classes as to preserve the distribution of the outcome in the training and test sets - regression: determine the quartiles of the data set and samples within those artificial groups
ames <- make_ames()
dim(ames)[1] 2930 81
library(rsample)
# make suret you get the same random numbers
set.seed(4595)
data_split <-initial_split(ames,strata="Sale_Price")
ames_train <- training(data_split)
ames_test <- testing(data_split)
nrow(ames_train)/nrow(ames)[1] 0.7505119
library(ggplot2)
# Do the distribution line-up?
ggplot(ames_train,aes(x=Sale_Price))+
geom_line(stat = "density",
trim = TRUE) +
geom_line(data = ames_test,
stat = "density",
trim = TRUE, col = "red")To fit a model to the housing data, the model terms must be specified. Historically, there are two main interfaces for doing this.
The fomula interface using R fomula rules to specify a symbolic representation of the terms and variables. For example:
foo(Sale_Price ~ Neightborhood + Year_Sold + Neighborhood:Year_Sold, data=ames_train)OR
foo(Sale_Price~., data=ames_train)OR
foo(log10(Sale_Price)~ns(Longitude, df=3)+ns(Latitude,df=3),data=ames_train)This is very convenient but it has some disadvantages.
foo(y ~ pca(scale(x1), scale(x2), scale(x3)), data =dat).Some modeling function have the non-formula interface. This usually has arguments for the predictors and the outcome(s):
# Usually, the variable must all be numeric
pre_vars <- c("Year_Sold","Longitude","Latitude")
foo(x=ames_train[,pre_vars],
y=ames_train$Sale_Price)This is inconvenient if you have transformations, factor variables, interactions or any other operations apply prior to modeling.
Overall, it is difficult to predict if a package has one or both of these interfaces. For example, lm only has formulas.
There is a third interface using recipes that will be discussed later that solve some of these issues.
Let’s start by fitting an ordinary linear regression model to the training set. You can choose the model terms for your model but I will use a very simple model:
head(ames_train)simple_lm <- lm(log10(Sale_Price)~Longitude+Latitude, data=ames_train)Before looking at coefficients, we should do some model checking to see if there is anything obviously wrong with the model.
To get the statistics on the individual points, we willl use the awesome broom package:
library(broom)
library(magrittr)
simple_lm_values <- augment(simple_lm)
simple_lm_values %>% names() [1] "log10.Sale_Price." "Longitude" "Latitude" ".fitted"
[5] ".se.fit" ".resid" ".hat" ".sigma"
[9] ".cooksd" ".std.resid"
From these results, let’s do some visualizations:
Are there any downsides to this approach?
If you use the summary method on the lm object, the buttom shows some statistics.
summary(simple_lm)
Call:
lm(formula = log10(Sale_Price) ~ Longitude + Latitude, data = ames_train)
Residuals:
Min 1Q Median 3Q Max
-1.01769 -0.09771 -0.01536 0.10003 0.57637
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -316.1528 15.0273 -21.04 <2e-16 ***
Longitude -2.0792 0.1346 -15.45 <2e-16 ***
Latitude 3.0135 0.1880 16.03 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1614 on 2196 degrees of freedom
Multiple R-squared: 0.1808, Adjusted R-squared: 0.1801
F-statistic: 242.3 on 2 and 2196 DF, p-value: < 2.2e-16
These statistics are the result of predicting the same data that was used to derive the coefficients. This is problematic because it can lead to optimistic results, especially for models that are extremely flexibile.
The test set is used for assessing performance. Should we predict the test set and use those results to estimate these statistics.
NOPE!
Save the test set until the very end when you have one or two models that are your favorite. We’ll need to use the training set but…
For some models, it is possible to get very small residuals by predicting the training set. That’s an issue since we will need to make comparisons between models, create diagnostic plots, etc.
If only we had a method for getting honest performance estimates from the training set..
There are additional data splitting schemes that are applied to training set. They attempt to stimulate slightly different versions of the traing set. These versions of the original are split into two model subsets.
This process is repated many times. There are different flavors or resampling but we will focus on two methods.
knitr::include_graphics("C:/Users/kojikm.mizumura/Desktop/Data Science/UseR 2018/Applied ML/resampling methods.PNG")These are additonal data splitting that are applied to the training set. They attempt to simpuate slightly different versions of the training set. These versions of the oroginal are split into two model subsets. - The analysis set ised o fit the model (analogous to the training set) - Peformance is determined using the assessment set.
This process is repeated many times.
There are different flavors or resampling but we will focus on two methods.
Here, we randomly split the training data into V distinct blocks of roughly equal size. - We leave out the first block of analysis data and fit a model. - This model is used to predict the held-out block of assessment data. - We continue this process until we’ve predicted all V assessment blocks.
The final performance is based on the hold-out predictions by averaging the statistics from the V blocks. V is usually taken to be 5 or 10 and leave one out cross-validation has each sample as a block.
knitr::include_graphics("C:/Users/kojikm.mizumura/Desktop/Data Science/UseR 2018/Applied ML/cv-plot-1.png")A boostrap sample is the same size as the training set but each data point is selected with replacement.
This means that the analysis set will have more than one replicate of a training set instance.
The assessment set contains all samples that were never included in the bootstrap set. It is often called the out-of-bag sample and can vary in size.
On average, 63.1220559% of the training set is contained at least once in the bootstrap sample.
knitr::include_graphics("C:/Users/kojikm.mizumura/Desktop/Data Science/UseR 2018/Applied ML/boot-plot-1.png")If you think of resampling in the same manner as statistical estimators (e.g., maximum likelihood), this becomes a trade-off bias and variance:
There are length blog posts about this subject here and here.
I tend to favor 5 repeats of 10-fold cross-validation unless the size of the assessment data is “large enough”.
For example, 10% of the Ames training set is 219 properties and this is probably good enough to estimate the \(RMSE\) and \(R^2\).
rsamplelibrary(AmesHousing)
library(rsample)
set.seed(2433)
cv_splits <- vfold_cv(ames_train, v=10, strata="Sale_Price")
cv_splits# 10-fold cross-validation using stratification
The split objects contain the information about the sample size.
cv_splits$splits[[1]]<1977/222/2199>
We use the analysis and assessment functions to get the data.
analysis(cv_splits$splits[[1]]) %>% dim()[1] 1977 81
assessment(cv_splits$splits[[1]]) %>% dim()[1] 222 81
We Will need to write a function to fit the model to each data set and another to compute performance.
library(yardstick)package 㠼㸱eyardstick㠼㸱f was built under R version 3.5.1
Attaching package: 㠼㸱eyardstick㠼㸱f
The following objects are masked from 㠼㸱epackage:caret㠼㸱f:
mnLogLoss, precision, recall
The following object is masked from 㠼㸱epackage:readr㠼㸱f:
spec
For performance, the first argument should the rsplit objected contained in cv_splits$splits:
model_perf <- function(data_split, mod_obj) {
vars <- rsample::form_pred(mod_obj$terms)
assess_dat <- assessment(data_split) %>%
select(!!!vars, Sale_Price) %>%
mutate(
pred = predict(
mod_obj,
newdata = assessment(data_split)
),
Sale_Price = log10(Sale_Price)
)
rmse <- assess_dat %>%
rmse(truth = Sale_Price, estimate = pred)
rsq <- assess_dat %>%
rsq(truth = Sale_Price, estimate = pred)
data.frame(rmse = rmse, rsq = rsq)
}The purrr package will be used to fit the model to each analysis set. There will be saved in a column called lm_mod:
library(purrr)
cv_splits <- cv_splits %>%
mutate(lm_mod=map(splits, lm_fit, formula=form))
cv_splits# 10-fold cross-validation using stratification
Now, let’s compute the two performance measures:
# map2 can be used to move over two objects of equal length
library(dplyr)
lm_res <- map2_df(cv_splits$splits, cv_splits$lm_mod, model_perf) %>%
dplyr::rename(rmse_simple=rmse, rsq_simple=rsq)
lm_res %>% head()# Merge in results:
cv_splits <- cv_splits %>% bind_cols(lm_res)
# Rename the columns and compute the resampling estimates:
cv_splits %>% select(rmse_simple, rsq_simple) %>% colMeansrmse_simple rsq_simple
0.1612427 0.1845147
Previously, I mentioned that the performance metrics that were naively calculated from the training set could be optimistic. However, this approach estimates the RMSE to be 0.1614, and cross-validation produced an estimate of 0.1613. What was trhe big deal?
Linear regression is a high bias model. This means that it is fairly incapable at beign able to adpt the underlying model function (unless it is linear). For this reason, linear regression is unlikely to overfit to the training set and our two estimates are likely to be the same.
We’ll consider another model shortly that is low bias since it can, theoretically, easily adapt to a wide variety of true model functions.
However, as before, there is also variance to consider. Linear regression is very stable since it leverages all of the data points to estimate parameters. Other methods, such as tree-based models, are not and can drastically change if the training set data is slightly perturbed.
tl;dr: the earlier concern is real but linear regression is less likely to be affected.
Now let’s look at diagnostics using the predictions from the assessment sets.
get_assessment <- function(splits, model)
augment(model, newdata=assessment(splits)) %>%
mutate(.resid=log10(Sale_Price)-.fitted)
holdout_results <- map2_df(cv_splits$splits, cv_splits$lm_mod, get_assessment)
holdout_results %>% dim()[1] 2199 84
ames_train %>% dim()[1] 2199 81
A partial residual plot is used to diagnose what variables should have been in the model.We can plot the hold-out residuals versus different variables to understand if they should have been in the model - If the residuals have no pattern in the data, they are likely to be irrelevant. - If a pattern is seen, it suggests that the variable should have been in the model.
Take 10 min and use ggplot to investigate other predictors using the holdout_results data frame. geom_smooth might come in handy.
Now let’s consider a more flexible model that is low bas: K-nearest neighbors.The model stores the training set(including the outcome). When a new sample is predicted, \(K\) training set points are found that are most similar to the new sample being predicted.
The predicted value for the new sample is some summary statistic of the neighbors, usually: - the mean for regression, or - the mode for classification
When \(K\) is small, the model might be too responsive to underlying data. When \(K\) is large, it begins to “oversmooth” the neighbors and performance suffers.
Ordinary, since we are computing a distance, we would want to center and scale the predictors. Our two predictors are already on the same scale so we can skip this step.
Consider the 2-nearest neighbor model. Would there be a difference in the estimated model performance between re-prediction and cross-validation?
caret has a knnreg function that can be used (the kknn package is another option). IT has a formula method and we’ll use this to illustrate the model:
library(caret)
knn_train_mod <- knnreg(log10(Sale_Price)~Longitude+Latitude,
data=ames_train,
k=2)
repredict <- data.frame(price=log10(ames_train$Sale_Price)) %>%
mutate(pred=predict(knn_train_mod, newdata=ames_train %>% select(Longitude,Latitude)
)
)
repredict %>% rsq(truth = "price", estimate = "pred") # <- the ruckus is hereError in rsq(., truth = "price", estimate = "pred") :
could not find function "rsq"
Thats pretty good, but are we tricking ourselves? One of those two neighbors is always itself. To resample, let’s create another function to fit this model and follow the same resampling process as before:
knn_fit <- function(data_split,...)
knnreg(..., data=analysis(data_split))
cv_splits <- cv_splits %>%
mutate(knn_mod=map(splits, knn_fit, formula=form, k=2))
knn_res <- map2_df(cv_splits$splits, cv_splits$knn_mod, model_perf) %>%
rename(rmse_knn=rmse, rsq_knn=rsq)
# merge in results
cv_splits <- cv_splits %>% bind_cols(knn_res)
colMeans(knn_res) rmse_knn rsq_knn
0.1012993 0.6869034
The model appears to be a drastic improvement over simple linear regression, but we are definitely getting highly optimistic results by re-predicting the training set.
We can try to make a more formal assesment of the two current models. Both models used the same resamples, so we have 10 estimates of performance that are matched. Does the matching mean anything?
Most likely YES. It is very common to see that there is a resample effect. Similar to repeated measures designs, we can expect a relationship between models and resamples. For example, some resamples will have the worst performance over different models and so on.
In other words, there is usually a within-resample correlation. For the two models, the estimated correlation in RMSE values is 0.85.
With only two models, a paired t-test can be used to estimate the difference in RMSe between the models:
t.test(cv_splits$rmse_simple, cv_splits$rmse_knn, paired=TRUE)
Paired t-test
data: cv_splits$rmse_simple and cv_splits$rmse_knn
t = 21.265, df = 9, p-value = 5.282e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.05356677 0.06632000
sample estimates:
mean of the differences
0.05994338
Hothorn et al (2012) is the original paper on comparing models using resampling.
We’ll do more extensive analyses with tidyposterior soon.
Overfitting occurs when a model inappropriately pick up on trends in the training set that do not generalize to new samples. When this occurs, assesment of the model based on the training set can show good performance that does not reproduce in future samples.
Some models have specific “knobs” to control over-fitting - neighborhood size in nearest neighbor models is an example - the number of splits in a tree model
often poor choices for these parameters can result in overfitting
For example, the next slide shows a data set with two predictors. We want to be able to produce a line (i.e., decision boundary) that differentiates two classes of data.
On the next slide, two classification boundaries are shown for a different model type not yet discussed. The differece in the two panes is solely due to different choices in tuning parameters. One overfits the training data.
We usually don’t have two-dimensional data so a quantitative method for under measureing overfitting is needed. Resampling fits that description. A simple method for tuning a model is used to grid search:
├── Create a set of candidate tuning parameter values └── For each resample │ ├── Split the data into analysis and assessment sets │ ├── [preprocess data] │ ├── For each tuning parameter value │ │ ├── Fit the model using the analysis set │ │ └── Compute the performance on the assessment set and save ├── For each tuning parameter value, average the performance over resamples ├── Determine the best tuning parameter value └── Create the final model with the optimal parameter(s) on the training set
Random search is a similar technique where the candidate set of parameter values are simulated at random across a wide range. Also, an example of nested resampling can be found here.
The bad news is that all of the models (except the final model) are discared. However, as the good news, all of the models (except the final model) can be run in parallel. Let’s look at the Ames K-NN model and evaluate \(K=1,2,\dot,20\) using the same 10-fold cross-validation as before.
We’ll start coding this algorithm from the inside out.
These steps are: ├── Fit the model using the analysis set └── Compute the performance on the assessment set and save
split will be the one of elements of cv_splits$splits.
knn_rmse <- function(k, split) {
mod <- knnreg(log10(Sale_Price) ~ Longitude + Latitude,
data = analysis(split),
k = k)
# Extract the names of the predictors
preds <- form_pred(mod$terms)
data.frame(Sale_Price = log10(assessment(split)$Sale_Price)) %>%
mutate(pred = predict(mod, assessment(split) %>% select(!!!preds))) %>%
rmse(Sale_Price, pred)
}│ ├── For each tuning parameter value │ │ └── Run knn_rmse
knn_grid <- function(split) {
# Create grid
tibble(k = 1:20) %>%
# Execute grid for this resample
mutate(
rmse = map_dbl(k, knn_rmse, split = split),
# Attach the resample indicators using `lables`
id = labels(split)[[1]]
)
}The return values here is a tibble with columns for k, the RMSE, and the fold ID (e.g., Fold01).
└── For each resample │ └── Run knn_grid
Here, resamp is the resample object cv_splits
iter_over_resamples <-
function(resamp)
map_df(resamp$splits, knn_grid)library(rsample)
knn_tune_res <- iter_over_resamples(cv_splits)
knn_tune_res %>% head(15)To summarize the results for each value of \(K\):
library(tidyverse)
rmse_by_k <- knn_tune_res %>%
group_by(k) %>%
summarize(rmse=mean(rmse))
ggplot(rmse_by_k, aes(x=k, y=rmse))+
geom_point()+geom_line()Although it is numerically optimal, we are not required to use a value of 4 neighbors for the final model.
How stable is this? We can also plot the individual curves and their minimums.